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Abstract 

Non-negative matrix factorization (NMF) has previously been shown to be a use- 
ful decomposition for multivariate data. We interpret the factorization in a new 
way and use it to generate missing attributes from test data. We provide a joint 
optimization scheme for the missing attributes as well as the NMF factors. We 
prove the monotonic convergence of our algorithms. We present classification 
results for cases with missing attributes. 



The nonnegative matrix factorization (NMF) has been shown recently to be useful for many ap- 
plications in environment, pattern recognition, multimedia, text mining, and DNA gene expres- 
sions |EI[H|T2l[l0l. NMF can be traced back to 1970s (Notes from G. Golub) and has been studied 
extensively by Paatero fl2|. The work of Lee and Seung |8, 9| brought much attention to NMF 
(^ in machine learning and data mining fields. Various extensions and variations of NMF have been 

proposed recently ||3l|4||5][Tl|T31. NMF, in its most general form, can be described by the following 
factorization 



o 

^ j^dxN ^ y^dxrj^rxN q>j 

K^ where d is the dimension of the data, N is the number of data points (usually more than d) and r < d. 

. {^ Generally, this factorization has been compared with data decomposition techniques. In this sense 

S^ W is called the set of basis functions and the set H is the data specific weights. It has been claimed 

;h by numerous researchers that such a decomposition has some favorable properties over other similar 

^ decompositions, such as PGA etc. 

In the vast amount of literature present in this area, the parameter r largely goes unnoticed. We pose 
the question, what are the fundamental differences in the decomposition for the three cases r < d, 
r = d and r > d. The NMF decomposition forr<d can be imagined to be an energy compaction 
process and as such, only basis vectors with higher energy remain in the decomposition. For the case 
of r = d, we can think W as some sort of rotation in d— dimensions and as such the locally linear 
attributes of the data are preserved, as can be verified by finding the indices of the nearest neighbors 
of each data point in X as well as H. 

Now the remaining question is what happens for r > d. It is at this juncture that we want to 
concentrate our research and draw meaningful conclusions from experimental as well as empirical 
analysis. To develop a superficial motivation we look into the literature of sparse coding 1 1 1 , 6|. The 
basic idea which we borrow, from them is the fact that r need not be limited by the dimensionality 
of the data. The similarity has been shown to be even greater if an additional sparseness constraint 
is introduced into the optimization framework |7|. We motivate our analysis from a classification 
point of view. In the actual application domain we would like to handle missing attributes. 



2 Additive NMF 



In this section we introduce the idea of addi- 
tive NMF (ANMF) which can be motivated by 
the following scenario. Assume that given a 
non-negative matrix X, we have run NMF al- 
gorithm for a long amount of time, but due to 
the inherent sub-optimal nature of the NMF al- 
gorithm, we have only converged to a local op- 
timum. Now, we can look at the residue matrix 
Ri = X — WH, and then perform the decom- 
position again such that we find Ri = WiHi. 
By coupling the sub-optimality conditions on 
the original and the second decomposition, we 
can claim that \\Ri\\ > \\Ri - WiHi\\. This 
leads us to the generic ANMF formulation 



X 



s.t 



Y^W^H, 



Wi.Hi >0 Vi 



(2) 



(3) 



This decomposition is inherently equivalent to 
the standard NMF fork — 1. Given such a 
formulation we can write the update equations 
in one of the two ways 




Figure 1: Error norm after 50000 iterations. The plots 
show only last 500 error norms. Black: first decom- 
position (traditional NMF), red: second decomposition 
and green: final decomposition. The last scheme ran for 
only 3000 iterations for convergence {err < 10 ~^). 



2.1 Multi-NMF updates 

This scheme essentially means that we employ 

NMF updates for each value of i for the residue obtained from all the previous values, namely 1 to 

i — 1. The error values for /c = 3 for the scurve data is shown in Fig.[T] 

2.2 ANMF updates 

Proceeding in a way similar to Lee and Seung f^l, we can write an update scheme for the ANMF 
scheme. Writing the update equation for H^^^ , we can write 



where we have dropped the index k for simplicity. Substituting 

m 



WJ{T..W,H,) 



leads to a simple multiplicative update for H, and an analogous scheme for W. 
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'wnE.WiH^y 
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n+l 



W? 



XHJ 



{EiW.H,)m 



(4) 



2.2.1 Convergence sketch 

Convergence of the SNMF scheme can be proved in the same manner as done by Lee and Seung fgl. 
The auxiliary function G(h^h^) remains exactly the same for us in form, the only difference being 
the first order derivative which in our case is 



VF(/i*) = -Wj{X - J2 WiHi) 



The minimizer for the auxiliary function can now be shown to be exactly similar to the update rules 
mentioned in Eqn.|4] 



3 ANMF for Missing Attributes 

The training data is used to learn W, with r > d. This can be viewed as developing an over-complete 
dictionary from the data. The hope is that this over-complete dictionary will encode enough infor- 
mation, to guess the values of missing attributes, which can be further used for classification. The 
similar procedure f or r < d has no guarantee to encode extra information, since the matrix W will 
be rank limited by the dimension r and hence removing a row from d might eliminate a rank di- 
mension. The basic idea is that since NMF results in a decomposition of feature dependent (W) 
and data dependent term (H), we can remove the particular row from W for which we do not have 
information, and still generate a good estimate for the data dependent term H for the data point with 
missing attributes. A simple multiplication with the whole W then gives the approximation for the 
missing attributes. 

The generic data imputation based classification algorithm is as follows: 

• Training: Assume labeled training data without missing attributes and find the decomposi- 
tion Xtr ~ WtrHtr = Xtr 

• Now keeping the same Wtr find the decomposition XteMte ^ WtrHteMfe- The mask 
Mfe is placed to zero out the rows of Wtr corresponding to the missing attributes. Finally, 
the joint estimate for the missing attributes can be obtained from Wtr Hie = ^te- 

• Learn a classifier for Xtr- Generate the classification results for Xte- 

Some of the advantages of the decomposition is that the training data decomposition can be done 
offline once, and then the learned set of basis functions Wtr can be used for the test data transforma- 
tion. Also, the classification engine does not need to perform any additional task because we convert 
the data back to its original dimension. 

4 Algorithm Details 

From here on, for the rest of the development, we work on a single test point x G M^ and present all 
the analysis based on a single point. The extension to multiple points X is straight forward. We also 
assume, WLOG, that the last attribute x^ is the missing attribute, and follow the notations mentioned 
in Eqn.[5]for the rest of the development. 



X 



W 



W 



(5) 



The optimization scheme for the observed part, x, can now be written as 

X = Wh (6) 

The update equations can be obtained directly from the update rules of Lee and Seung (g). Once the 
iterations have converged we can find the missing attribute from the projection Xd = W^^h. 

Theorem 1. The squared error 

(l/2)(||x-Whf 

is non-increasing under the following updates 

T 

xT^ = Wdh„, h„+i=h„o— 5^^ (7) 

W Wh„ 



where x and W are as defined in Eqn. |5] 



Proof. The squared error can be written as 



min F(xd,h) = (l/2)(||xd-Wrfhf + ||x-Wh||2) (8) 



Writing the first order derivatives with respect to x^ and h and equating them to zero we get 

V,,F(x^,h) = (xrf-Wrfh)=0 (9) 



(10) 



VhF(x^,h) = -Wj(xrf-Wrfh)-W A(x-Wh) 

= -W^(x-Wh) 

The update for Xd is simply obtained from Eqn.|9] Eqn.[To] suggests that the update for h can now 
be obtained by solving the reduced system 



min F(h) = (l/2)(||x - Wh|n 

h 



(11) 



which is the same as Eqn. [6j for which the optimum non-negative, non-increasing update has been 
shown to be the same as Eqn. [7] |9 1 . D 

A similar extension can now be applied to the SNMF scheme, which leads us to the following claim: 
Claim 1. The squared error 

k 

(l/2)(||x-^W,h,f 



is non-increasing under the following updates 

k 



X 



n+1 



Ew>r, 



T 






w, Erw,h« 



(12) 



where x is as defined in Eqn. ^and W^ 's are defined analogous to W. 

5 Experiments 

First we present results on manifold data as shown in Fig. [2] As can be seen from the results, similar 
color dots, which have one axis value artificially set to zero are pulled closer to same color data 
points on the true manifold. 
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Figure 2: Left: input data with missing attributes in one dimension only, right: our result. 



Next we present results for the WDBCl data from UCI Machine learning repository. The data is 
represented as 30 dimensional vectors, with 2 possible classes. There are total 569 data points. We 
randomly select about 80% of the data ar training data and the rest as testing data. 



The baseline performance denotes the classification accuracy with complete data. We introduce 
the missing attributes in the following way: for each test data point we generate a 30 dimensional 
random vector R G (0, 1)^^. All the indices in the vector R having values less than a threshold 
t = 0.3 are marked for deletion. All marked indices are subsequently replaced by zeros in the test 
data point. This process is repeated for the entire test data set. 

The comparison is shown in the following table 



Dataset 


Baseline 


Missing 30% (Zero substitution) 


NMF with missing 


WDBC 


97 


86.95 


91.91 


Ion 


85.91 


73.23 


76.05 


Pima 


76.67 


69.48 


70.12 


Echo 


88.89 


77.78 


88.89 



In the next experiment we guess the value of the missing attributes, in one of the following manner: 
zero substitute, mean substitute, and random substitute. The results are shown in the following table. 
All the results are for the WDBC dataset (base accuracy 93.07%). 



% missing 


Zero 


Mean 


Random 


NMF 


10 


92.17 


91.30 


92.17 


95.17 


20 


89.57 


85.22 


89.57 


93.30 


30 


81.74 


68.70 


81.74 


91.30 


40 


80.35 


64.16 


80.35 


87.61 
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